SemanticScuttle - klotz.me » Tags: data engineering

Tags: data engineering*

0 bookmark(s) - Sort by: Date ↓ / Title /

Building a Data Dashboard Using the Streamlit Python Library

This article introduces Streamlit, a Python library for building data dashboards, as a solution for Python programmers to create graphical front-ends without needing to delve into CSS, HTML, or JavaScript. The author, a seasoned data engineer, explains how Streamlit and similar tools enable the creation of attractive dashboards, marking a shift from traditional tools like Tableau or Quicksight. This piece serves as the first in a series focusing on Streamlit, with future articles planned on Gradio and Taipy. The author aims to replicate similar layouts and functionalities across dashboards using consistent data.

2025-01-21 Tags: streamlit, python, data, dashboard, data engineering, gradio, visualization, taipy by klotz

Apache Kafka Explained

This article explains the challenges of data integration in modern systems and how Apache Kafka addresses these issues by providing a decoupled, scalable, and maintainable architecture through its publish-subscribe model. The article covers Kafka’s architecture, core concepts, and benefits for real-time data streaming and event-driven systems.

2025-01-11 Tags: apache kafka, pubsub, data engineering, distributed systems by klotz

10 Pandas One-Liners for Quick Data Quality Checks

These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.

| Code Snippet | Explanation |
| --- | --- |
| `df.isnull().sum()` | Counts the number of missing values per column. |
| `df.duplicated().sum()` | Counts the number of duplicate rows in the DataFrame. |
| `df.describe()` | Provides basic descriptive statistics of numerical columns. |
| `df.info()` | Displays a concise summary of the DataFrame including data types and presence of null values. |
| `df.nunique()` | Counts the number of unique values per column. |
| `df.apply(lambda x: x.nunique() / x.count() * 100)` | Computes the percentage of unique values for each column. |
| `df.isin( value » ).sum()` | Counts the number of occurrences of a specific value across all columns. |
| `df.applymap(lambda x: isinstance(x, type_to_check)).sum()` | Counts the number of values of a specific type (e.g., int, str) per column. |
| `df.dtypes` | Lists the data type for each column in the DataFrame. |
| `df.sample(n)` | Returns a random sample of n rows from the DataFrame. |

2025-01-03 Tags: pandas, data quality, one-liners, data cleaning, python, data engineering by klotz

Talk to Airflow — Build an AI Agent Using PydanticAI and Gemini 2.0

An article on building an AI agent to interact with Apache Airflow using PydanticAI and Gemini 2.0, providing a structured and reliable method for managing DAGs through natural language queries.

- Agent interacts with Apache Airflow via the Airflow REST API.
- Agent can understand natural language queries about workflows, fetch real-time status updates, and return structured data.
- Sample DAGs are implemented for demonstration purposes.

2024-12-29 Tags: agent, apache, airflow, pydanticai, gemini 2.0, llm, data engineering, structured output, airflow dag, natural language queries, production engineering by klotz

Breser - a simple query syntax

Breser stands for Business Rules & Expression Syntax for Easy Retrieval. It is a powerful and flexible query language designed for efficient log processing and structured data filtering.

2024-12-17 Tags: breser, logs, data, filtering, query language, log analysis, query, expression, spl, production engineering, data engineering by klotz

How I’d Learn Apache Iceberg (if I Had To Start Over)

A seven-week structured self-paced study guide for learning Apache Iceberg and its ecosystem, created after the author realized its increasing relevance in the data industry.

2024-12-15 Tags: apache, iceberg, data engineering by klotz

Apache Iceberg: The Hadoop of the Modern Data Stack?

Apache Iceberg is emerging as a cornerstone for data lakes and lakehouses in the modern data stack, drawing parallels to the rise of Hadoop a decade ago. This article explores these similarities, highlighting both the opportunities and challenges that Iceberg presents for data engineering.

2024-12-15 Tags: apache, iceberg, hadoop, data lake, lakehouse, data engineering, meradata by klotz

Deep Dive into New Amazon S3 Tables

A detailed exploration of Amazon S3 Tables, a new solution for scalable storage and management of tabular data leveraging Apache Iceberg, including features, setup, security, and benefits over traditional storage methods.

2024-12-13 Tags: s3, apache iceberg, aws, tabular data, storage, data engineering by klotz

Explainable Generic ML Pipeline with MLflow

An article detailing how to build a flexible, explainable, and algorithm-agnostic ML pipeline with MLflow, focusing on preprocessing, model training, and SHAP-based explanations.

2024-11-27 Tags: mlops, pipeline, mlflow, shap, xai, data engineering, feature engineering, machine learning, eda by klotz

Apache Iceberg Won the Future — What’s Next for 2025?

The article discusses the rise of Apache Iceberg as the dominant open table format, backed by major endorsements, and outlines key developments expected for 2025 such as Role-Based Access Control (RBAC) catalogs, Change Data Capture (CDC) capabilities, and materialized views.

2024-11-20 Tags: apache iceberg, data engineering, data lakehouse by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: data engineering*

Linked Tags

Related Tags